fix: SDK-4475 wait for in-flight init in initWithContextSuspend to avoid SessionService NPE#2637
fix: SDK-4475 wait for in-flight init in initWithContextSuspend to avoid SessionService NPE#2637abdulraqeeb33 wants to merge 3 commits intomainfrom
Conversation
…oid SessionService NPE
Under SDK_BACKGROUND_THREADING, the public initWithContext(context, appId)
runs internalInit() on a fire-and-forget IO coroutine. The internal-only
suspend overload initWithContext(context) -- used by SyncJobService.onStartJob
-- was returning true as soon as initState was IN_PROGRESS, even though
bootstrap() had not yet populated SessionService.session. SyncJobService
would then call runBackgroundServices(), which invokes
SessionService.backgroundRun() -> endSession() and NPEs on session!!.isValid.
Fix the suspend overload to honor its documented contract ("Remain suspend
until initialization is fully completed"): when init is already in flight,
suspend on suspendCompletion.await() until it actually completes, then
return based on the final state. The public sync overload is unchanged --
host app's MainApplication.onCreate() still returns immediately, no ANR
risk re-introduced.
Also add defensive null guards to SessionService.endSession / onFocus /
onUnfocused / startTime / scheduleBackgroundRunIn so any future caller
that bypasses bootstrap() no-ops instead of crashing.
Adds a regression test (SDKInitTests) that stalls a first
initWithContextSuspend inside internalInit (via a custom
BlockingPrefsContext that signals when prefs are first touched), kicks off
a second initWithContextSuspend, and asserts that os.isInitialized is true
at the moment the second call returns. Verified to fail on main and pass
with the fix.
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses a race condition under SDK_BACKGROUND_THREADING where a re-entrant suspend initialization (initWithContextSuspend) could return before bootstrap completed, allowing background services (notably SessionService) to run with uninitialized/null internal models.
Changes:
- Updated
OneSignalImp.initWithContextSuspendto await in-flight initialization completion before returning a result. - Added a regression test that reproduces the re-entrant init race and asserts the suspend init does not return early.
- Added defensive null-guards in
SessionServiceto avoid NPEs when invoked beforebootstrap().
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| OneSignalSDK/onesignal/core/src/main/java/com/onesignal/internal/OneSignalImp.kt | Makes re-entrant suspend init wait for completion instead of returning during IN_PROGRESS. |
| OneSignalSDK/onesignal/core/src/main/java/com/onesignal/session/internal/session/impl/SessionService.kt | Adds pre-bootstrap null handling to prevent crashes when background services run too early. |
| OneSignalSDK/onesignal/core/src/test/java/com/onesignal/core/internal/application/SDKInitTests.kt | Adds a regression test to catch early-return behavior during in-flight initialization. |
Comments suppressed due to low confidence (1)
OneSignalSDK/onesignal/core/src/main/java/com/onesignal/internal/OneSignalImp.kt:662
- If
internalInit(...)throws (or any unexpected exception occurs beforenotifyInitComplete()),initStatecan remainIN_PROGRESSandsuspendCompletionmay never complete. With the newsuspendCompletion.await()path, this can hang re-entrant init callers indefinitely. Wrap the init execution soinitStateis set toFAILEDand the completion signal is completed in afinallyblock (capturing the exception intoinitFailureExceptionas appropriate).
if (!shouldRunInit) {
// Another caller has already started (or completed) init. Honor this method's
// contract by suspending until initialization is *fully* completed -- not just
// kicked off. This closes a race where re-entrant suspend callers (e.g. the
// SyncJobService entry point under SDK_BACKGROUND_THREADING) would otherwise
// proceed to use IBackgroundService implementations like SessionService whose
// bootstrap() had not yet run, NPE'ing on still-null model fields.
Logging.log(LogLevel.DEBUG, "initWithContext: init already in progress or completed, awaiting completion")
suspendCompletion.await()
return@withContext initState == InitState.SUCCESS
}
val result = internalInit(context, appId)
// initState is already set correctly in internalInit, no need to overwrite it
result
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| val blockingCtx = object : ContextWrapper(context) { | ||
| override fun getSharedPreferences(name: String, mode: Int): SharedPreferences { | ||
| started.countDown() | ||
| trigger.await() | ||
| return super.getSharedPreferences(name, mode) |
There was a problem hiding this comment.
Good catch — addressed in 524c45b. Bounded trigger.await(30, TimeUnit.SECONDS) inside the ContextWrapper.getSharedPreferences() overrides for both this test and the original in-flight init test, so a logic bug or cancelled coroutine can't leave the IO worker blocking forever and stalling the suite.
| // Sanity: the second caller has not pre-empted the test by returning before | ||
| // we unblock the first caller (timing depends on lazy ServiceProvider locks). | ||
| Thread.sleep(200) | ||
|
|
||
| // Unblock the first caller so internalInit() can complete (state -> SUCCESS). | ||
| trigger.countDown() |
There was a problem hiding this comment.
Acknowledged. The 200ms sleep is intentionally testing a negative assertion — "the second caller did NOT return within window X" — which collapses to the same wall-clock semantics whether expressed as Thread.sleep, delay(), or polling, since polling for an event that should never fire is just a sleep with extra steps. The flake risk is only on the upper bound (we picked 200ms, the actual window can be much smaller), and the bounded trigger.await(30s) from 524c45b ensures the test fails-fast rather than hanging if 200ms is ever insufficient.
Happy to switch to delay() for stylistic consistency with coroutine-based tests if you'd prefer — the semantic is identical.
| ) : ISessionService, IBootstrapService, IStartableService, IBackgroundService, IApplicationLifecycleHandler { | ||
| override val startTime: Long | ||
| get() = session!!.startTime | ||
| get() = session?.startTime ?: 0L |
There was a problem hiding this comment.
Fixed in 524c45b. startTime now defaults to _time.currentTimeMillis when bootstrap hasn't run, so call sites computing now - startTime (IAM session-duration in InAppMessagesManager and SESSION_TIME triggers in DynamicTriggerController) see ~0ms elapsed instead of ~58 years (Jan 1970).
| val shouldRunInit: Boolean | ||
| synchronized(initLock) { | ||
| if (initState.isSDKAccessible()) { | ||
| Logging.log(LogLevel.DEBUG, "initWithContext: SDK already initialized or in progress") | ||
| return@withContext true | ||
| shouldRunInit = !initState.isSDKAccessible() | ||
| if (shouldRunInit) { | ||
| initState = InitState.IN_PROGRESS | ||
| } | ||
| } |
There was a problem hiding this comment.
Addressed in 4b5d86f. suspendCompletion is now a @Volatile var reset under synchronized(initLock) whenever state flips into IN_PROGRESS — applied to both the sync initWithContext(ctx, appId) and suspend initWithContextSuspend paths to keep them consistent. Both await sites (initWithContextSuspend and waitUntilInitInternal) now local-capture the deferred under the lock so they wait on the same generation they observed.
Added regression test initWithContextSuspend resets latch on retry-after-FAILED which: forces a FAILED first init, kicks off a stalled retry, then a re-entrant suspend caller. Verified to fail on the un-hardened code with expected:<true> but was:<false> (the re-entrant caller wakes on the stale latch and reads transient IN_PROGRESS state) and pass with the fix.
| Logging.log(LogLevel.DEBUG, "initWithContext(context: $context, appId: $appId)") | ||
|
|
||
| // This ensures the stack trace points to the caller that triggered init, not the async worker thread. | ||
| initFailureException = IllegalStateException("OneSignal initWithContext failed.") | ||
|
|
||
| // Use IO dispatcher for initialization to prevent ANRs and optimize for I/O operations | ||
| return withContext(runtimeIoDispatcher) { | ||
| // do not do this again if already initialized or init is in progress | ||
| val shouldRunInit: Boolean | ||
| synchronized(initLock) { | ||
| if (initState.isSDKAccessible()) { | ||
| Logging.log(LogLevel.DEBUG, "initWithContext: SDK already initialized or in progress") | ||
| return@withContext true | ||
| shouldRunInit = !initState.isSDKAccessible() | ||
| if (shouldRunInit) { | ||
| initState = InitState.IN_PROGRESS | ||
| } | ||
| } |
There was a problem hiding this comment.
Fixed in 524c45b. initFailureException is now only assigned inside the shouldRunInit == true branch of the synchronized(initLock) block, so re-entrant callers (the SyncJobService entry point in particular) no longer overwrite the original initiator's failure-attribution stack trace.
Addresses review feedback on PR #2637 (claude[bot] + Copilot AI): Defect 1 (stale latch on retry-after-FAILED): suspendCompletion was a single-shot `val CompletableDeferred<Unit>`. Once any init terminated -- including FAILED -- the deferred stayed permanently complete. A re-entrant suspend caller arriving DURING a subsequent retry would `await()` on the already-completed deferred, return instantly, and read transient initState (likely IN_PROGRESS -> false), silently dropping JobService work. Same flaw also affected `waitUntilInitInternal`. Make suspendCompletion mutable (`@Volatile var`) and reset it under `synchronized(initLock)` whenever state flips into IN_PROGRESS (both sync `initWithContext(ctx, appId)` and suspend `initWithContextSuspend`). Both await sites local-capture the deferred under the lock so they wait on the same generation they observed -- never on a stale one. Defect 2 (indefinite hang if internalInit throws): internalInit had no try/catch wrapping its body. An unchecked throw from initEssentials/bootstrapServices/subscribeToConfigStore/updateConfig/ userSwitcher.initUser/startupService.scheduleStart would leave initState at IN_PROGRESS and suspendCompletion uncompleted forever. With the new await() path introduced in PR #2637, every re-entrant suspend caller (SyncJobService) would hang on await() until the OS killed the worker. Wrap internalInit's body in try/catch: on any throw, attach the cause to initFailureException, set initState = FAILED, call notifyInitComplete(), and return false instead of rethrowing. Guarantees a terminal state and released waiters on every code path. Two new regression tests in SDKInitTests: - "resets latch on retry-after-FAILED": stalls a retry after a forced FAILED first init, kicks off a re-entrant caller, asserts it doesn't wake on the stale latch (verified to fail on the un-hardened code with expected:<true> but was:<false>). - "reaches terminal state when internalInit throws": forces a throw inside internalInit, asserts the suspend init returns false and the SDK can retry cleanly afterwards (verified to fail on the un-hardened code with the throw escaping as RuntimeException). Co-authored-by: Cursor <cursoragent@cursor.com>
- SessionService.startTime: return `_time.currentTimeMillis` (~0ms elapsed) instead of `0L` (~58 years elapsed) when bootstrap hasn't run, so call sites computing `now - startTime` (IAM session-duration / SESSION_TIME triggers) get a sensible default rather than Jan 1970 deltas. (Copilot review comment) - initWithContextSuspend: only assign `initFailureException` when the call actually starts init (`shouldRunInit == true`). Re-entrant callers no longer overwrite the original initiator's failure-attribution stack trace. (Copilot review comment) - SDKInitTests: bound the `trigger.await()` inside the in-test ContextWrapper overrides to 30s so that a logic bug elsewhere in the test (or a cancelled coroutine) can't deadlock the IO worker forever and stall the suite. (Copilot review comment) Co-authored-by: Cursor <cursoragent@cursor.com>
Addresses review feedback on PR #2637 (claude[bot] + Copilot AI): Defect 1 (stale latch on retry-after-FAILED): suspendCompletion was a single-shot `val CompletableDeferred<Unit>`. Once any init terminated -- including FAILED -- the deferred stayed permanently complete. A re-entrant suspend caller arriving DURING a subsequent retry would `await()` on the already-completed deferred, return instantly, and read transient initState (likely IN_PROGRESS -> false), silently dropping JobService work. Same flaw also affected `waitUntilInitInternal`. Make suspendCompletion mutable (`@Volatile var`) and reset it under `synchronized(initLock)` whenever state flips into IN_PROGRESS (both sync `initWithContext(ctx, appId)` and suspend `initWithContextSuspend`). Both await sites local-capture the deferred under the lock so they wait on the same generation they observed -- never on a stale one. Defect 2 (indefinite hang if internalInit throws): internalInit had no try/catch wrapping its body. An unchecked throw from initEssentials/bootstrapServices/subscribeToConfigStore/updateConfig/ userSwitcher.initUser/startupService.scheduleStart would leave initState at IN_PROGRESS and suspendCompletion uncompleted forever. With the new await() path introduced in PR #2637, every re-entrant suspend caller (SyncJobService) would hang on await() until the OS killed the worker. Wrap internalInit's body in try/catch: on any throw, attach the cause to initFailureException, set initState = FAILED, call notifyInitComplete(), and return false instead of rethrowing. Guarantees a terminal state and released waiters on every code path. Two new regression tests in SDKInitTests: - "resets latch on retry-after-FAILED": stalls a retry after a forced FAILED first init, kicks off a re-entrant caller, asserts it doesn't wake on the stale latch (verified to fail on the un-hardened code with expected:<true> but was:<false>). - "reaches terminal state when internalInit throws": forces a throw inside internalInit, asserts the suspend init returns false and the SDK can retry cleanly afterwards (verified to fail on the un-hardened code with the throw escaping as RuntimeException). Co-authored-by: Cursor <cursoragent@cursor.com>
- SessionService.startTime: return `_time.currentTimeMillis` (~0ms elapsed) instead of `0L` (~58 years elapsed) when bootstrap hasn't run, so call sites computing `now - startTime` (IAM session-duration / SESSION_TIME triggers) get a sensible default rather than Jan 1970 deltas. (Copilot review comment) - initWithContextSuspend: only assign `initFailureException` when the call actually starts init (`shouldRunInit == true`). Re-entrant callers no longer overwrite the original initiator's failure-attribution stack trace. (Copilot review comment) - SDKInitTests: bound the `trigger.await()` inside the in-test ContextWrapper overrides to 30s so that a logic bug elsewhere in the test (or a cancelled coroutine) can't deadlock the IO worker forever and stall the suite. (Copilot review comment) Co-authored-by: Cursor <cursoragent@cursor.com>
Addresses review feedback on PR #2637 (claude[bot] + Copilot AI): Defect 1 (stale latch on retry-after-FAILED): suspendCompletion was a single-shot `val CompletableDeferred<Unit>`. Once any init terminated -- including FAILED -- the deferred stayed permanently complete. A re-entrant suspend caller arriving DURING a subsequent retry would `await()` on the already-completed deferred, return instantly, and read transient initState (likely IN_PROGRESS -> false), silently dropping JobService work. Same flaw also affected `waitUntilInitInternal`. Make suspendCompletion mutable (`@Volatile var`) and reset it under `synchronized(initLock)` whenever state flips into IN_PROGRESS (both sync `initWithContext(ctx, appId)` and suspend `initWithContextSuspend`). Both await sites local-capture the deferred under the lock so they wait on the same generation they observed -- never on a stale one. Defect 2 (indefinite hang if internalInit throws): internalInit had no try/catch wrapping its body. An unchecked throw from initEssentials/bootstrapServices/subscribeToConfigStore/updateConfig/ userSwitcher.initUser/startupService.scheduleStart would leave initState at IN_PROGRESS and suspendCompletion uncompleted forever. With the new await() path introduced in PR #2637, every re-entrant suspend caller (SyncJobService) would hang on await() until the OS killed the worker. Wrap internalInit's body in try/catch: on any throw, attach the cause to initFailureException, set initState = FAILED, call notifyInitComplete(), and return false instead of rethrowing. Guarantees a terminal state and released waiters on every code path. Two new regression tests in SDKInitTests: - "resets latch on retry-after-FAILED": stalls a retry after a forced FAILED first init, kicks off a re-entrant caller, asserts it doesn't wake on the stale latch (verified to fail on the un-hardened code with expected:<true> but was:<false>). - "reaches terminal state when internalInit throws": forces a throw inside internalInit, asserts the suspend init returns false and the SDK can retry cleanly afterwards (verified to fail on the un-hardened code with the throw escaping as RuntimeException). Co-authored-by: Cursor <cursoragent@cursor.com>
- SessionService.startTime: return `_time.currentTimeMillis` (~0ms elapsed) instead of `0L` (~58 years elapsed) when bootstrap hasn't run, so call sites computing `now - startTime` (IAM session-duration / SESSION_TIME triggers) get a sensible default rather than Jan 1970 deltas. (Copilot review comment) - initWithContextSuspend: only assign `initFailureException` when the call actually starts init (`shouldRunInit == true`). Re-entrant callers no longer overwrite the original initiator's failure-attribution stack trace. (Copilot review comment) - SDKInitTests: bound the `trigger.await()` inside the in-test ContextWrapper overrides to 30s so that a logic bug elsewhere in the test (or a cancelled coroutine) can't deadlock the IO worker forever and stall the suite. (Copilot review comment) Co-authored-by: Cursor <cursoragent@cursor.com>
|
Branch unstacked from #2605 + force-pushed. The branch was previously rebased on top of #2605 ( What changed:
Local verification:
When #2605 lands on |
|
@claude review |
| } catch (t: Throwable) { | ||
| // Any unchecked throw from initEssentials/bootstrapServices/subscribeToConfigStore/ | ||
| // updateConfig/userSwitcher.initUser/startupService.scheduleStart would otherwise leave | ||
| // initState at IN_PROGRESS forever and `suspendCompletion` uncompleted -- causing | ||
| // re-entrant suspend callers (e.g. SyncJobService) to hang indefinitely on `await()`. | ||
| // Always reach a terminal state and signal waiters before propagating. | ||
| Logging.error("OneSignal: internalInit threw unexpectedly; marking init FAILED", t) | ||
| initFailureException = (initFailureException ?: IllegalStateException("OneSignal initWithContext failed.")).apply { | ||
| addSuppressed(t) | ||
| } | ||
| initState = InitState.FAILED | ||
| notifyInitComplete() | ||
| return false | ||
| } |
There was a problem hiding this comment.
🔴 Two TOCTOU races on the latch lifecycle re-introduce the very SDK-4475 failure modes this PR is fixing.\n\n**(1) Completer-side wrong-latch race.** notifyInitComplete() (L484) reads the volatile suspendCompletion at .complete() time with no lock and no local capture. All three FAILED paths in internalInit (L359-360, L379-380, L402-403) execute initState = FAILED and notifyInitComplete() as two unsynchronized statements. A concurrent retry can land between them, acquire initLock, observe FAILED → !isSDKAccessible, flip state→IN_PROGRESS and reset suspendCompletion = L2. The original thread then completes L2 (the wrong generation) instead of L1 — L1 waiters hang forever (the SyncJobService budget-slot leak this PR was meant to close), L2 waiters wake on transient IN_PROGRESS state.\n\n**(2) Awaiter-side stale-state race.** After completionToAwait!!.await() returns at L536 / L703, both call sites re-read live volatile initState outside any lock (L548 / L704). Coroutine continuation reschedule is non-trivial, and a concurrent retry-after-FAILED can flip state during that gap. In initWithContextSuspend this returns false based on transient IN_PROGRESS (silent JobService work drop); in waitUntilInitInternal the L548 FAILED check fails, control falls through past "// initState is guaranteed to be SUCCESS here" and the caller invokes getter() against a not-yet-bootstrapped service — exactly the SessionService-NPE class.\n\nBoth PR regression tests serialize the FAILED→retry sequence so neither covers the concurrent window. A single fix addresses both: change the latch payload to CompletableDeferred<InitState>, have completer paths call .complete(state) on a local captured at the start of internalInit, and have awaiter sites return based on the awaited result rather than re-reading the volatile field.
Extended reasoning...
Summary
The @volatile var + reset-under-lock + local-capture-at-await-site pattern landed in 4101068 closes the PR's targeted failure mode but leaves two unaddressed TOCTOU windows on the same latch lifecycle. Both windows are reachable via realistic host-app + SyncJobService interleavings, and both re-introduce the contract violations this PR was specifically designed to fix.
(1) Completer-side wrong-latch race
notifyInitComplete() (OneSignalImp.kt:483-485) is:
private fun notifyInitComplete() {
suspendCompletion.complete(Unit)
}It reads the @Volatile field at .complete() time. The await sites local-capture under initLock, but the completer side does not — so it can complete a different generation of the latch than the one it was supposed to release. The non-atomic pair initState = FAILED + notifyInitComplete() appears at three sites:
- L358-361 (user-locked branch)
- L378-381 (missing appId branch)
- L402-403 (catch block)
All three live outside any synchronized block. isSDKAccessible() returns false for FAILED, so a concurrent caller landing in the synchronized block at L678-693 (or L305-315) is allowed to transition state and reset the latch.
Step-by-step proof
Initial state: initState = NOT_STARTED, suspendCompletion = D0.
- T_A calls
initWithContextSuspend(ctx, "appId"). Synchronized block flipsinitState = IN_PROGRESS, allocatessuspendCompletion = D1. Releases lock. BeginsinternalInit. - T_C (re-entrant suspend caller, e.g.
SyncJobService.onStartJob → initWithContext(this) → initWithContextSuspend(ctx, null)) enters the synchronized block at L678. ObservesisSDKAccessible() == true(state is IN_PROGRESS), capturescompletionToAwait = suspendCompletion = D1. Releases lock. Suspends onD1.await()at L703. - T_A:
internalInitbody throws (e.g. insidebootstrapServicesiterating registeredIBootstrapServiceimplementations). Catch block executes at L398-404. Reaches L402:initState = InitState.FAILED. ⏸️ (thread preemption, GC pause, etc.) - T_B (separate thread, e.g. host-app retry
OneSignal.initWithContext(ctx, "appId")) enters the synchronized block at L305 (sync path) or L678 (suspend path). ObservesinitState == FAILED,!isSDKAccessible(). FlipsinitState = IN_PROGRESS, allocatessuspendCompletion = D2. Releases lock. Spawns its owninternalInit. - T_A: resumes at L403 →
notifyInitComplete()→ reads@Volatile suspendCompletion= D2 →D2.complete(Unit).
Outcomes
- D1 is never completed. T_C hangs on
D1.await()forever — the exact SyncJobService budget-slot leak this PR was meant to prevent. - D2 is completed prematurely. Any waiter that captures D2 after T_B's reset wakes immediately. In
initWithContextSuspendit returnsinitState == SUCCESSagainst transient IN_PROGRESS →false, silently dropping JobService work. InwaitUntilInitInternalthe L548 FAILED check fails, the function falls through past the misleading "// initState is guaranteed to be SUCCESS here" comment, and the caller proceeds togetter()against a service whose retry-init bootstrap has not run.
The SUCCESS path at L389-390 is safe because state == SUCCESS keeps isSDKAccessible() == true, so no concurrent caller can reset the latch.
(2) Awaiter-side stale-state race
The symmetric defect is on the awaiter side. After completionToAwait!!.await() returns, both call sites re-read live volatile initState outside the lock:
waitUntilInitInternal: L536 (await) → L548 (if (initState == InitState.FAILED))initWithContextSuspend: L703 (await) → L704 (return@withContext initState == InitState.SUCCESS)
The gap between latch completion and continuation resumption depends on coroutine dispatcher scheduling and can be substantial. A concurrent retry-after-FAILED during that gap flips state from FAILED to IN_PROGRESS, defeating the post-await check. This race exists even when notifyInitComplete() correctly completes the right latch — i.e. it is independent of the bug above.
For initWithContextSuspend, returning false on transient IN_PROGRESS is a silent JobService work drop. For waitUntilInitInternal, falling through past the FAILED check on transient IN_PROGRESS is the SessionService-NPE class — the caller invokes getter() against services whose retry init has not bootstrapped (configModel.appId not set, userSwitcher.initUser not called, startupService.scheduleStart not run).
Why the PR's regression tests don't cover this
initWithContextSuspend resets latch on retry-after-FAILEDserializes:firstInit shouldBe falseruns to completion before the second init starts.initWithContextSuspend reaches terminal state when internalInit throwsis single-threaded.- The original
in-flight init waits for completiontest does not exercise FAILED retry at all.
Neither test launches a thread between initState = FAILED and notifyInitComplete(), and none launch a re-entrant caller between await() returning and the post-await state read.
Suggested fix
A single change addresses both races:
- Change the latch payload to encode the terminal state:
CompletableDeferred<InitState>. - At the start of
internalInit, captureval myCompletion = suspendCompletion(under or just after the lock that reset it). - All terminal-state paths call
myCompletion.complete(InitState.FAILED)ormyCompletion.complete(InitState.SUCCESS)on the captured local — never on the volatile field. - Awaiter sites use the awaited result (
val terminalState = completionToAwait!!.await()) instead of re-reading volatileinitState.
This pattern eliminates both windows by making the terminal state of each generation observable as the latch payload, decoupling it from the live volatile field that concurrent retries mutate.
Alternatively, holding initLock across initState = FAILED and notifyInitComplete() closes window (1) but does not close window (2); window (2) requires either the result-in-payload pattern or a loop that re-captures and re-awaits while observed state is IN_PROGRESS.
| } | ||
|
|
||
| initState = InitState.IN_PROGRESS | ||
| // Fresh latch for this init attempt -- prevents retry-after-FAILED waiters from | ||
| // observing the prior init's already-completed deferred. | ||
| suspendCompletion = CompletableDeferred() | ||
| } | ||
|
|
||
| // FeatureManager depends on ConfigModelStore/PreferencesService which requires appContext. |
There was a problem hiding this comment.
🔴 Sync initWithContext(context, appId) resets suspendCompletion and sets initState=IN_PROGRESS inside the synchronized block (lines 305-315), then calls ensureApplicationServiceStarted(context) at line 319 outside any try/catch — the new try/catch only wraps internalInit's body. If that call throws (e.g. ApplicationService.start line 81's context.applicationContext as Application cast on a non-Application context), initState is leaked at IN_PROGRESS and the just-reset latch is never completed, causing every re-entrant suspend caller (initWithContextSuspend, waitUntilInitInternal) to hang on await() indefinitely — the exact inverted failure mode this PR's internalInit catch block was designed to prevent. Fix by mirroring the same try/catch around lines 317-332 (or moving ensureApplicationServiceStarted inside internalInit's protected body).
Extended reasoning...
What the bug is
The PR added a try/catch around internalInit's body (with a comment explicitly naming the hang failure mode it prevents) and a fresh-latch reset under synchronized(initLock) at the start of every init attempt. But on the sync initWithContext(context, appId) entry path, there is unprotected code between the latch reset and the internalInit call. That gap re-introduces the very hang the catch block was designed to prevent.
Code path
OneSignalImp.kt lines 297-333:
override fun initWithContext(context: Context, appId: String): Boolean {
synchronized(initLock) {
if (initState.isSDKAccessible()) { return true }
initState = InitState.IN_PROGRESS
suspendCompletion = CompletableDeferred() // L1: fresh latch
}
ensureApplicationServiceStarted(context) // <-- line 319, OUTSIDE try/catch
if (isBackgroundThreadingEnabled) {
suspendifyOnIO { internalInit(context, appId) } // catch lives inside internalInit
return true
}
return runBlocking(runtimeIoDispatcher) { internalInit(context, appId) }
}ensureApplicationServiceStarted reaches ApplicationService.start(context) (ApplicationService.kt line 78), which does:
val application = context.applicationContext as Application // line 81That cast throws ClassCastException for any Context whose applicationContext is not an Application instance — uncommon in production but realistic for hosts initializing from ContentProvider contexts (some apps do this for pre-onCreate init), test/instrumentation harnesses that wrap or mock Context, or restricted/launcher contexts that return a ContextWrapper.
Why existing code does not prevent it
The PR's new try { ... } catch (t: Throwable) { ... initState = FAILED; notifyInitComplete() ... } is scoped to internalInit's body (OneSignalImp.kt lines ~360-405). It does not wrap the lines between the synchronized block and the internalInit invocation in the sync entry path. The PR comment on that catch even enumerates the exact symptom — "would otherwise leave initState at IN_PROGRESS forever and suspendCompletion uncompleted -- causing re-entrant suspend callers (e.g. SyncJobService) to hang indefinitely on await()" — but the protection is one stack frame too narrow.
isBackgroundThreadingEnabled itself is already defensively wrapped (catches Throwable from the feature-manager lookup), so the only realistic throw point in the unprotected window is ensureApplicationServiceStarted.
Step-by-step proof
- Host
MainApplication.onCreate()callsOneSignal.initWithContext(weirdContext, "appId")whereweirdContext.applicationContextis aContextWrapper(not anApplication). - Synchronized block (lines 305-315):
initState = IN_PROGRESS,suspendCompletion = L1(fresh). - Line 319:
ensureApplicationServiceStarted(weirdContext)→(applicationService as ApplicationService).start(weirdContext)→val application = weirdContext.applicationContext as ApplicationthrowsClassCastException. - Throw escapes
initWithContext. SDK state:initState=IN_PROGRESS,suspendCompletion=L1uncompleted,applicationServiceStarted=false. Host catches the throw (or the process keeps running because Application onCreate exceptions are sometimes survivable, or a persisted JobService entry outlives the throwing process invocation). - Later,
SyncJobService.onStartJobfires and callsOneSignal.initWithContext(this)(suspend, no-appId) →initWithContextSuspend(ctx, null). - Synchronized block at lines 678-693:
isSDKAccessible()istrue(IN_PROGRESS counts).shouldRunInit=false, capturescompletionToAwait = suspendCompletion = L1. - Line 703:
completionToAwait!!.await()— blocks forever on L1, which no code path will ever complete because nointernalInitwas invoked for this generation. SyncJobServicecoroutine hangs until the OS kills the worker, holding its JobService budget slot — the same hang failure mode the PR's catch block was supposed to prevent.
The same hang affects waitUntilInitInternal (line 533): any subsequent accessor call (os.user, os.notifications, etc.) under SDK_BACKGROUND_THREADING blocks forever in runBlocking { waitUntilInitInternal() } -> await() on L1.
Why this is a regression introduced by this PR
- Pre-PR:
suspendCompletionwas a single-shotval;initWithContextSuspenddid notawait()on it. A re-entrant suspend caller observingisSDKAccessible() == trueearly-returnedtruewithout awaiting (the SDK-4475 NPE this PR fixes). A leaked latch was harmless because nothing awaited. - Post-PR: re-entrant suspend callers now
await()on the freshly-reset latch (line 703), andwaitUntilInitInternaldoes the same (line 533). A leaked latch from a pre-internalInitthrow turns the prior NPE-on-stale-state into an indefinite hang. The bug class shifts from "returns too early" to "never returns" — exactly the inversion the PR'sinternalInitcatch was designed to prevent.
Impact
Trigger probability is narrow (most hosts pass an Activity / Application / Service context whose applicationContext is the Application), but the consequence is severe and matches the precise hang the PR is attempting to close. SyncJobService callers hold JobService budget slots until the OS terminates the worker; accessor calls under SDK_BACKGROUND_THREADING will block forever. Importantly, the PR's defensive intent is explicit in the new catch comment, so this is a literal gap in the same defensive pattern rather than a hypothetical future hardening.
Suggested fix
Mirror the internalInit catch around the sync-entry body. Two equivalent options:
Option A — wrap the post-lock body:
synchronized(initLock) { ... }
try {
ensureApplicationServiceStarted(context)
if (isBackgroundThreadingEnabled) {
suspendifyOnIO { internalInit(context, appId) }
return true
}
return runBlocking(runtimeIoDispatcher) { internalInit(context, appId) }
} catch (t: Throwable) {
initFailureException = (initFailureException ?: IllegalStateException(...)).apply { addSuppressed(t) }
initState = InitState.FAILED
notifyInitComplete()
throw t
}Option B — move ensureApplicationServiceStarted into internalInit so it falls under the existing try/catch. Would need to preserve the order with the isBackgroundThreadingEnabled evaluation (which currently relies on applicationServiceStarted being true to avoid returning false prematurely).
| initState = InitState.IN_PROGRESS | ||
| // Fresh latch for this init attempt. | ||
| suspendCompletion = CompletableDeferred() | ||
| // Only the call that actually starts init owns the failure-attribution exception. | ||
| // Re-entrant callers must not overwrite it -- otherwise the failure stack trace | ||
| // would point at the SyncJobService coroutine instead of the original initiator. | ||
| initFailureException = IllegalStateException("OneSignal initWithContext failed.") |
There was a problem hiding this comment.
🟡 Nit / pre-existing-style: The fix to gate initFailureException assignment behind shouldRunInit (responding to Copilot 3196748195) inadvertently moved the IllegalStateException(...) constructor inside withContext(runtimeIoDispatcher). The field's declaration comment at OneSignalImp.kt:62 still says "Save the exception pointing to the caller that triggered init, not the async worker thread," and the deleted comment at the old assignment site ("This ensures the stack trace points to the caller that triggered init, not the async worker thread") spelled out the same intent — both are now technically violated. Practical impact is small (the message + suppressed cause carry the real diagnostic info, and on suspend paths the "caller frames" were always coroutine continuation frames anyway), but if you want to honor the documented invariant the construction can be hoisted above withContext while the assignment stays under the lock — i.e. val newFailureEx = IllegalStateException(...) outside, then initFailureException = newFailureEx inside the shouldRunInit branch.
Extended reasoning...
What changed
Pre-PR (per the diff context), initWithContextSuspend constructed and assigned initFailureException immediately on entry, before withContext(runtimeIoDispatcher):
// pre-PR
initFailureException = IllegalStateException("OneSignal initWithContext failed.")
// This ensures the stack trace points to the caller that triggered init, not the async worker thread.
return withContext(runtimeIoDispatcher) { ... }Post-PR (OneSignalImp.kt:691), the assignment lives inside the shouldRunInit branch of the synchronized(initLock) block — which is itself inside withContext(runtimeIoDispatcher). The constructor therefore now runs on an IO worker continuation, and the captured stack trace contains IO dispatcher frames rather than the caller's coroutine continuation frames. The pre-PR comment that documented the invariant at the assignment site was deleted, but the field-level comment at OneSignalImp.kt:62 still asserts it: Save the exception pointing to the caller that triggered init, not the async worker thread.
Why the change happened
This was a deliberate response to Copilot inline-comment 3196748195: initFailureException is overwritten at the start of every initWithContextSuspend call, even when this invocation does not actually start init. The fix correctly gates the assignment so re-entrant suspend callers (notably the SyncJobService entry point) no longer overwrite the original initiator's failure-attribution exception. That part is sound. The accidental side effect is that hoisting the assignment under the lock also relocated the constructor into the IO continuation.
Practical impact (why this is a nit, not a blocker)
- The substantive diagnostic info is preserved. The exception's message (
OneSignal initWithContext failed.) is unchanged, and the suppressed cause attached ininternalInit(theaddSuppressedpaths at lines ~691 and ~702) carries the actual failure-site stack into the bootstrap path. Crash reporters and debuggers surfacing this exception still get the meaningful information. - Pre-PR, the "caller's stack" guarantee was already weak on the suspend path.
initWithContextSuspendis a suspend function — its callers are already in a coroutine, so the JVM stack at the constructor call site (whether before or afterwithContext) is dominated by continuation/dispatcher machinery; the user's actual code site (e.g.launch { os.initWithContextSuspend(...) }) is not present at any point. The deleted comment overstated what was actually preserved. - The typical sync
initWithContext(ctx, appId)path is unaffected — that path doesn't touchinitFailureExceptionat all (only the catch block ininternalInitdoes, on whichever runner thread). - The existing test
accessor instances after initWithContext without appID shows the failure reason(SDKInitTests.kt:75-83) only asserts on.messageandsuppressed[0].message, not on the stack trace, so the regression is not test-detected.
Step-by-step proof of the regression
- Host calls
os.initWithContextSuspend(context)from inside a coroutine on the Main dispatcher (or fromrunBlockingon a non-coroutine thread). - Execution enters
initWithContextSuspendbody. Pre-PR,IllegalStateException(...)is constructed here — its captured stack contains the coroutine's continuation frames on the Main dispatcher (or the runBlocking thread). Post-PR, no construction happens yet. withContext(runtimeIoDispatcher)switches to an IO dispatcher worker. Post-PR, the synchronized block runs here, andIllegalStateException(...)is constructed now — its captured stack contains IO worker continuation frames.- Init eventually fails. Some later
waitUntilInitInternalcaller throwsinitFailureException. Pre-PR: stack trace shows Main dispatcher / runBlocking-thread continuation. Post-PR: stack trace shows IO worker continuation. - Both stacks are coroutine machinery; neither cleanly points at the host app's call site. But pre-PR was at least consistent with the documented invariant of "caller's dispatcher," whereas post-PR is unambiguously "IO worker."
Suggested fix (if you want to honor the documented invariant)
Hoist the constructor above withContext and only the assignment under the lock. The two concerns (no-overwrite and caller-dispatcher capture) are orthogonal:
override suspend fun initWithContextSuspend(...): Boolean {
Logging.log(...)
// Capture stack on caller's dispatcher.
val newFailureEx = IllegalStateException("OneSignal initWithContext failed.")
return withContext(runtimeIoDispatcher) {
synchronized(initLock) {
if (initState.isSDKAccessible()) { ... }
else {
initState = InitState.IN_PROGRESS
suspendCompletion = CompletableDeferred()
initFailureException = newFailureEx // assignment under lock; stack captured outside withContext
}
}
...
}
}Or alternately: drop the field-level comment to match what's actually preserved (message + suppressed cause), since the "caller frames" claim was already largely aspirational on a suspend path. Either fix is fine; the current code is internally inconsistent.
| // existing detekt baseline entry was for the unannotated signature; once we added | ||
| // @Suppress(TooGenericExceptionCaught) the baseline ID no longer matched, so we | ||
| // suppress ReturnCount inline rather than churning the baseline. |
There was a problem hiding this comment.
nit, autogenerated comments can be quite verbose and not that relevant, and I'm quite guilty of this too
Description
One Line Summary
Closes a race under
SDK_BACKGROUND_THREADINGwhereSyncJobServicere-entering the internal suspendinitWithContextreturnedtruebefore bootstrap finished, causingSessionService.backgroundRun()to NPE on a still-nullsessionfield.Linear: SDK-4475
Details
Motivation
Production stack (only seen with
SDK_BACKGROUND_THREADINGenabled):The race:
OneSignal.initWithContext(context, appId). Under the FF this flipsinitStatetoIN_PROGRESSand runsinternalIniton a fire-and-forget IO coroutine, then returns immediately.internalInit(beforebootstrapServices()has calledSessionService.bootstrap()),SyncJobService.onStartJobfires on a separate IO worker.SyncJobServicecallsOneSignal.initWithContext(this)(the internal suspend overload, no-appId).initWithContextSuspendsawinitState.isSDKAccessible() == true(becauseIN_PROGRESSis "accessible") and returnedtruewithout waiting for bootstrap to finish — violating its own documented contract: "Remain suspend until initialization is fully completed."SyncJobServiceproceeded toOneSignal.getService<IBackgroundManager>().runBackgroundServices(), which loops throughIBackgroundServices and callsbackgroundRun()onSessionServicewhosesessionfield was stillnull. NPE onif (!session!!.isValid).With FF off, public
initWithContextrunsrunBlocking { internalInit(...) }synchronously — bootstrap is always finished before init returns. The window doesn't exist.Scope
OneSignalImp.initWithContextSuspend— when init is already in flight, now suspends onsuspendCompletion.await()until it actually completes, then returns based on the final state (SUCCESS→ true,FAILED→ false). Honors the documented contract.initWithContext(context, appId)— intentionally unchanged. The fire-and-forget under FF-on is the entire point of the FF (no ANR on host app'sMainApplication.onCreate()). The wait belongs on the suspend re-entry path, where the caller (SyncJobService) is already on a background coroutine and depends on the SDK being fully ready for its very next line.SessionService— defensive null guards inendSession/onFocus/onUnfocused/startTime/scheduleBackgroundRunInso any future caller that bypassesbootstrap()no-ops instead of crashing here.Wait duration
Bounded by actual
internalInitcost —initEssentials+bootstrapServices+resolveAppId+userSwitcher.initUser. Tens to hundreds of ms in practice, well withinJobServicebudgets.Relationship to #2605
#2605 addresses the same general class (callers acting while
IN_PROGRESS) but at the accessor side (getServiceWithFeatureGateblocks onIN_PROGRESS). It does not close this NPE becauseSyncJobServicereachesIBackgroundManagervia rawgetService<T>()(IServiceProvider), which doesn't go through the accessor gate. The two fixes are complementary; merge order is independent.Testing
Unit testing
Added
SDKInitTests."initWithContextSuspend with in-flight init waits for completion before returning":initWithContextSuspendinsideinternalInitvia a customBlockingPrefsContextthat signals (viaCountDownLatch) whengetSharedPreferencesis first touched — by which pointinitStateis deterministicallyIN_PROGRESS.initWithContextSuspendand capturesos.isInitializedat the exact moment that second call returns.os.isInitialized == trueat return — i.e. the second call did not return until init was fully complete.Verified locally:
OneSignalImp.ktreverted): test FAILS withexpected:<true> but was:<false>— exactly catches the bug.Full
:OneSignal:core:testReleaseUnitTestrun: 801 tests, only the 2 unrelated pre-existingSDKInitTestsfailures remain onmain(those are what #2605 addresses); confirmed pre-existing by stashing this PR and re-running. AllSDKInitSuspendTests(11) andSessionListenerTests(3) pass.Manual testing
Not directly reproducible at will without instrumentation (race window depends on JobService timing). Repro path is documented above; the regression test deterministically reproduces the contract violation.
Affected code checklist
Checklist
Overview
Testing
SDKInitTestsfailures remain — unrelated, addressed by fix: resolve pre-existing test failures on main #2605)Final pass
Made with Cursor